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ABSTRACT 

A parallel algorithm has perfect strong scaling if its running 
time on P processors is linear in 1 /P, including all commu- 
nication costs. Distributed-memory parallel algorithms for 
matrix multiplication with perfect strong scaling have only 
recently been found. One is based on classical matrix multi- 
plication (Solomonik and Demmel, 2011), and one is based 
on Strassen's fast matrix multiplication (Ballard, Demmel, 
Holtz, Lipshitz, and Schwartz, 2012). Both algorithms scale 
perfectly, but only up to some number of processors where 
the inter-processor communication no longer scales. 

We obtain a memory-independent communication cost 
lower bound on classical and Strassen-based distributed- 
memory matrix multiplication algorithms. These bounds 
imply that no classical or Strassen-based parallel matrix 
multiplication algorithm can strongly scale perfectly beyond 
the ranges already attained by the two parallel algorithms 
mentioned above. The memory-independent bounds and the 
strong scaling bounds generalize to other algorithms. 
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1. INTRODUCTION 

In evaluating the recently proposed parallel algorithm 
based on Strassen's matrix multiplication [2] and compar- 
ing the communication costs to the known lower bounds [3] , 
we found a gap between the upper and lower bounds for 
certain problem sizes. The main motivation of this work 
is to close this gap by tightening the lower bound for this 
case, proving that the algorithm is optimal in all cases, up 
to 0(log P) factors. A similar scenario exists in the case of 
classical matrix multiplication; in this work we provide the 
analogous tightening of the existing lower bound [5] to show 
optimality of another recently proposed algorithm [7]. 

In addition to proving optimality of algorithms, the lower 
bounds in this paper yield another interesting conclusion re- 
garding strong scaling. We say that an algorithm strongly 
scales perfectly if it attains running time on P processors 
which is linear in 1/P, including all communication costs. 
While it is possible for classical and Strassen-based ma- 
trix multiplication algorithms to strongly scale perfectly, the 
communication costs restrict the strong scaling ranges much 
more than do the computation costs. These ranges depend 
on the problem size relative to the local memory size, and 
on the computational complexity of the algorithm. 

Interestingly, in both cases the dominance of a memory- 
independent bound arises, and the strong scaling range ends, 
exactly when the memory-dependent latency lower bound 
becomes constant. This observation may provide a hint as to 
where to look for strong scaling ranges in other algorithms. 
Of course, since the latency cost cannot possibly drop be- 
low a constant, it is an immediate result of the memory- 
dependent bounds that the latency cost cannot continue to 
strongly scale perfectly. However the bandwidth cost typi- 
cally dominates the cost, and it is the memory-independent 
bandwidth scaling bounds that limit the strong scaling of 
matrix multiplication in practice. For simplicity we omit 
discussions of latency cost, since the number of messages is 
always a factor of M below the bandwidth cost in the strong 
scaling range, and is always constant outside the strong scal- 
ing range. 

While the main arguments in this work focus on matrix 
multiplication, we present results in such a way that they can 
be generalized to other algorithms, including other 0(n 3 )- 
based dense and sparse algorithms as in [2] and other fast 
matrix multiplication algorithms as in [3]. 

Our paper is organized as follows. In Section |2.1| we 
prove a memory-independent communication lower bound 
for Strassen-based matrix multiplication algorithms, and we 
prove an analogous bound for classical matrix multiplication 
in Section [2. 2| We discuss the implications of these bounds 
on strong scaling in Section [3] and compare the communica- 
tion costs of Strassen and classical matrix multiplication as 
the number of processors increases. In Section [4] we discuss 
generalization of our bounds to other algorithms. The main 
results of this paper are summarized in Table [I] 

2. COMMUNICATION LOWER BOUNDS 

We use the distributed-memory communication model 
(see, e.g., [4]), where the bandwidth-cost of an algorithm 
is proportional to the number of words communicated and 
the latency-cost is proportional to the number of messages 
communicated along the critical path. We will use the no- 
tation that n is the size of the matrices, P is the number 



of processors, M is the local memory size of each processor, 
and u)q — log 2 7 ~ 2.81 is the exponent of Strassen's matrix 
multiplication. 

2.1 Strassen's Matrix Multiplication 

In this section, we prove a memory-independent lower 
bound for Strassen's matrix multiplication of f2(n 2 /P 2// "°) 
words, where ujq — log 2 7. We reuse notation and proof tech- 
niques from E] . By prohibiting redundant computations we 
mean that each arithmetic operation is computed by exactly 
one processor. This is necessary for interpreting edge expan- 
sion as communication cost. 

Theorem 2.1. Suppose a parallel algorithm performing 
Strassen's matrix multiplication minimizes computational 
costs in an asymptotic sense and performs no redundant 
computation. Then, for sufficiently large P Q some processor 

must send or receive at least O, ^ P 2/w a J uiords. 

PROOF. The computation DAG (see e.g., [2] for formal 
definition) of Strassen's algorithm multiplying square ma- 
trices A ■ B = C can be partitioned into three subgraphs: an 
encoding of the elements of A, an encoding of the elements 
of B, and a decoding of the scalar multiplication results to 
compute the elements of C. These three subgraphs are con- 
nected by edges that correspond to scalar multiplications. 
Call the third subgraph Dec\ srl C, where lgn = log 2 n is the 
number of levels of recursion for matrices of dimension n. 

In order to minimize computational costs asymptotically, 
the running time for Strassen's matrix multiplication must 
be O(n uo /P). Since a constant fraction of the flops corre- 
spond to vertices in Dec\ sn C, this is possible only if some 

processor performs O (^p^") flops corresponding to vertices 

in DecignC. 

By Lemma 10 of 13], the edge expansion of DeCkC is given 
by h(Dec k C) = Q([4/7) k ). Using Claim 5 there (decompo- 
sition into edge disjoint small subgraphs), we deduce that 

h s (Dec lgn C) = n((^) 57 \ , (1) 

where h s is the edge expansion for sets of size at most s. 

Let S be the set of vertices of Dec\ sn C that correspond 
to computations performed by the given processor. Set 

s = \S\ = O (^V")- By equation |lj), the number of edges 

between S and S is 

\E(S,S)\ = (s ■ h s (Dec lgn C)) = (j^) , 

and because Deci sn C is of bounded degree (Fact 9 there) 
and each vertex is computed by only one processor, the 
number of words moved is Q(\E(S, S)\) and the result fol- 
lows. □ 

2.2 Classical Matrix Multiplication 

In this section, we prove a memory-independent lower 
bound for classical matrix multiplication of f2(n 2 /P 2 ^ 3 ) 
words. The same result appears elsewhere in the literature, 
under slightly different assumptions: in the LPRAM model 

1 The theorem applies to any P > 2 with a strict enough 
assumption on the load balance among vertices in Dec\ s „C 
as defined in the proof. 



[I], where no data exists in the (unbounded) local memo- 
ries at the start of the algorithm; in the distributed-mcmory 
model (SI, where the local memory size is assumed to be 
M = 6(n 2 /P 2//3 ); and in the distributed- memory model F], 
where the algorithm is assumed to perform a certain amount 
of input replication. Our bound is for the distributed mem- 
ory model, holds for any M, and assumes no specific com- 
munication pattern. 

Recall the following special case of the Loomis-Whitney 
geometric bound: 

Lemma 2.2. [6] Let V be a finite set of lattice points in 
R 3 , i.e., points (x,y,z) with integer coordinates. Let V x be 
the projection of V in the x-direction, i.e., all points (y,z) 
such that there exists an x so that (x,y,z) 6 V. Define V y 
and V z similarly. Let | ■ | denote the cardinality of a set. 
Then \V\< ^J\V X \ ■ \V V \ ■ \V Z \. 



Using Lemma 2.2 (in a similar way to [4l [5]), we can de- 



scribe the ratio between the number of scalar multiplications 
a processor performs and the amount of data it must access. 

Lemma 2.3. Suppose a processor has I words of initial 
data at the start of an algorithm, performs <d(n 3 / P) scalar 
multiplications within classical matrix multiplication, and 
then stores O words of output data at the end of the al- 
gorithm. Then the processor must send or receive at least 
Q,(n 2 1 P 2 / 3 ) — J — O words during the execution of the algo- 
rithm. 

PROOF. We follow the proofs in [2| [B]. Consider a dis- 
crete n x n x n cube where the lattice points correspond to 
the scalar multiplications within the matrix multiplication 
A ■ B (i.e., lattice point (i,j,k) corresponds to the scalar 
multiplication <Hk ■ bkj). Then the three pairs of faces of the 
cube correspond to the two input and one output matrices. 

The projections on the three faces correspond to the 
input /output elements the processor has to access (and 
must comm unica te if they are not in its local memory). 
By Lemma 



2.2 



and the fact that \7|14| • \V y \ ■ \V Z 



\V X \ + \ V y \ + I V z |) 3 , the number of words the processor 

must access is at least \V\ 2/:i = fi(n 2 /P 2/3 ). Since the 
processor starts with / words and ends with O words, the 
result follows. □ 

Theorem 2.4. Suppose a parallel algorithm performing 
classical dense matrix multiplication begins with one copy 
of the input matrices and minimizes computational costs in 
an asymptotic sense. Then, for sufficiently large PQ some 

processor must send or receive at least Q. ^ p "/ 3 ^ . 

Proof. At the end of the algorithm, every element of the 
output matrix must be fully computed and exist in some 
processor's local memory (though multiples copies of the 
element may exist in multiple memories). For each output 
element, we designate one memory location as the output 
and disregard all other copies. For each of the n 2 designated 
memory locations, we consider the n scalar multiplications 
whose results were used to compute its value and disregard 
all other redundantly computed scalar multiplications. 

In order to minimize computational costs asymptotically, 
the running time for classical dense matrix multiplication 

2 The theorem applies to any P > 2 with a strict enough 
assumption on the load balance. 
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Table 1: Bandwidth-cost lower bounds for matrix 
multiplication and perfect strong scaling ranges. 
The classical memory dependent bound is due to |5| , 
and the Strassen memory dependent bound is due 
to 3 . The memory-independent bounds are proved 
here, though variants of the classical bound appear 

must be 0(n 3 /P). This is possible only if at least a con- 
stant fraction of the processors perform Q \ of the scalar 

multiplications corresponding to designated outputs. 

Since there exists only one copy of the input matrices 
and designated output-0(n 2 ) words of data-some proces- 
sor which performs 0(n 3 /P) multiplications must start and 
end with no more than I + O — 0(n 2 /P) words of data. 
Thus, by Lemma |2.3[ some processor must read or write 
!T2(n 2 /P 2/3 ) - O(nTP) = fi(n 2 /P 2/3 ) words of data. □ 

3. LIMITS OF STRONG SCALING 

In this section we present limits of strong scaling of ma- 
trix multiplication algorithms. These are immediate impli- 
cations of the memory independent communication lower 
bounds proved in Section[2] Roughly speaking, the memory- 
dependent communication-cost lower-bound is of the form 
Q(f(n,M)/P) for both classical and Strassen matrix mul- 
tiplication algorithms. However, the memory independent 
lower bounds are of the form Q (f(n, M)/ P c ) where c < 1 
(see Table [T| . This implies that strong scaling is not pos- 
sible when the memory-independent bound dominates. We 
make this formal below. 

Corollary 3.1. Suppose a parallel algorithm performing 
Strassen's matrix multiplication minimizes bandwidth and 
computational costs in an asymptotic sense and performs 
no redundant computation. Then the algorithm can achieve 

perfect strong scaling only for P = O ( ) ■ 

Proof. By [3], any parallel algorithm performing ma- 
trix multiplication based on St rasse n moves at least 

^ ( PM " /2-i ) words. By Theorem |2.l| a parallel algorithm 
that minimizes computational costs and performs no redun- 
dant computation moves at least SI ( p"/^ ) words. This 

latter bound dominates in the case P — Q ^ A /L °/2 ) ■ Thus, 
while a communication-optimal algorithm will strongly scale 
perfectly up to this threshold, after the threshold the com- 
munication cost will scale as 1/P 2 /" rather than 1/P. □ 

Corollary 3.2. Suppose a parallel algorithm performing 
classical dense matrix multiplication starts and ends with 
one copy of the data and minimizes bandwidth and compu- 
tational costs in an asymptotic sense. Then the algorithm 

can achieve perfect strong scaling only for P = O ( ^"3/2 ) • 
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Figure 1: Bandwidth costs and strong scaling of 
matrix multiplication: classical vs. Strassen-based. 
Horizontal lines correspond to perfect strong scal- 
ing. P m in is the minimum number of processors re- 
quired to store the input and output matrices. 



Proof. By [5], any parallel algorithm performing matrix 
multiplication moves at least f2 ( ~p7W ) words. By Theo- 



rem [274] a parallel algorithm that starts and ends with one 
copy of the data and minimizes computational costs moves 
at least 0, ^ p " /3 ^ words. This latter bound dominates in 

the case P — £1 A " 3 / 2 ) ■ Thus, while a communication- 
optimal algorithm will strongly scale perfectly up to this 
threshold, after the threshold the communication cost will 
scale as 1/P 2/3 rather than 1/P. □ 

In Figure [l] we present the asymptotic communication 
costs of classical and Strassen-based algorithms for a fixed 
problem size as the number of processors increases. Both 
of the perfectly strong scaling algorithms stop scaling per- 
fectly above some number of processors, which depends on 
the matrix size and the available local memory size. 

Let P m i n = & (jkr) rje * ne m i mmum number of proces- 
sors required to store the input and output matrices. By 
Corollaries |3.1| and |3.2| the perfect strong scaling range is 
Pmin < P < Pmax where P max = @(P^l) in the classical 

case and P max = ©(Pm!'/ 2 ) m tne Strassen case. 

Note that the perfect strong scaling range is larger for the 
classical case, though the communication costs are higher. 

4. EXTENSIONS AND OPEN PROBLEMS 

The memory-independent bound and perfect strong scal- 
ing bound of Strassen's matrix multiplication (Theorem |2.1| 
and Corollary 3.1l apply to other Strassen-like algorithms, 
as defined in 4|, with o>o being the exponent of the total 
arithmetic count, provided that Deci gn C is connected. The 
proof follows that of Theorem |2.1| and of Corollary |3.1| but 
uses Claim 18 of |3 instead of Fact 9 there, and replaces 
Lemma 10 there with its extension. 

The memory-dependent bound of classical matrix multi- 
plication of [5] was generalized in H] to algorithms which 
perform computations of the form 



where Mem(i) denotes the argument in memory location i 
and fij and gijk are functions which depend non-trivially on 
their arguments (see [i] for more detailed definitions). 

The memory-independent bound of classical matrix mul- 
tiplication (Theorem |2.4[ ) applies to these other algorithms 
as well. If the algorithm begins with one copy of the input 
data and minimizes computational costs in an asymptotic 
sense, then, for sufficiently large P, some processor must 

send or receive at least f2 ^(^) 2 ^ 3 — ^) words, where G is 
the total number of g^jk computations and D is the number 
of non-zer os in the input an d out put. The proof follows that 
of Lemma|2.3|and Theorem |2.4| setting \V\ = G (instead of 



n 3 ), replacing n 3 /P with G/P , and setting I+O = 0(D/P) 
(instead of 0{n 2 /P)). 

Algorithms which fit the form of equation |2| include LU 
and Cholesky decompositions, sparse matrix-matrix mul- 
tiplication, as well as algorithms for solving the all-pairs- 
shortest-paths problem. Only a few of these have parallel 
algorithms which attain the lower bounds in all cases. In sev- 
eral cases, it seems likely that one can prove better bounds 
than those presented here, thus obtaining a stricter bound 
on perfect strong scaling. 

We also believe that our bounds can be generalized to QR 
decomposition and other orthogonal transformations, fast 
linear algebra, fast Fourier transform, and other recursive 
algorithms. 
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Mem(c(i,j)) = / IJ ((? ljfc (Mem(a(i,fc)),Mem(6(fc,i)))), (2) 



